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I .  Introduction 


Many  algorithms  require  the  calculation  of  a  sum 


n 

s  =  V  x ,  ,  n  >  3  , 
i=l  1 


where  x]/x2>  *  *  *>xn  are  numbers  represented  in  floating  point.  In 
practice,  an  approximate  sum  s  is  computed  with  rounding  errors. 
Wilkinson  [1]  shows  that  if  the  sum  is  accumulated  in  a  single-precision 
accumulator  (using  floating  binary  arithmetic  with  t  bits  of  precision 
and  proper  rounding) ,  then 

n 


s 


X.TJ. 

x 


> 


where 


(1  -  2-t)n+1-r 


<  1+  t)r  <  (1+  2_t)n+1" 


(r  =  1,  ...,n)  . 


Thus  the  error  bound  is  dependent  on  the  order  of  summation.  This 
result  has  led  to  the  well-known  rule  of  thumb  that  it  is  usually  best 
to  add  a  list  of  numbers  in  order  of  increasing  magnitude.  If  one  has 
a  priori  knowledge  of  the  x ^  (e.g.,  ][]  |xj  <  1  )  or  if  the  accumulation 

is  performed  with  more  precision  (say  double  precision),  then  much  smaller 
error  bounds  can  be  found.  However,  as  Wilkinson  points  out,  "It  should 
be  emphasized  that  we  still  cannot  guarantee  that  an  accumulated  sum  . . . 
has  a  low  relative  error." 

Large  relative  error  in  an  accumulated  sum  is  often  the  result  of 
a  phenomenon  which  Professor  D.  H.  Lehmer  calls  catastrophic  cancellation. 
This  occurs  when  an  intermediate  partial  sum  is  much  larger  in  magnitude 
than  the  final  sum.  Then  one  or  more  additions  result  in  a  loss  of 
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significant  digits.  The  post -normalization  step  of  a  subsequent 
addition  thus  introduces  zeros  in  place  of  significant  digits. 

However,  as  Professor  William  Kahan  has  observed,  this  large  cancellation 
is  not  the  cause  of  the  error  --  it  merely  reveals  the  error.  That  is, 
the  real  villain  here  is  not  the  cancellation,  but  rather  the  large 
intermediate  sums  within  a  floating-point  system  of  given  precision. 
Perhaps  "catastrophic  loss  of  precision”  would  be  a  more  appropriate 
name.  Catastophic  cancellation  is  fairly  common  with  poorly  designed 
algorithms;  most  good  algorithms  have  built-in  precautions  which  avoid 
(or  usually  avoid)  this  phenomenon. 

Large  relative  errors  can  occur  without  catastrophic  cancellation. 
This  happens  in  large  summations  (n  »  3)  where  the  intermediate  sums 
become  much  larger  in  magnitude  than  the  individual  addends,  but  not 
larger  than  the  final  sum.  This  sort  of  error  can  occur  in  numerical 
integration  using  a  large  number  of  intervals.  Wolfe  [2]  proposed  a 
technique  for  avoiding  this  type  of  error.  It  is  described  in  the 
following  section. 

In  the  remainder  of  this  report,  a  modification  of  Wolfe's  algorithm 
is  presented,  followed  by  a  detailed  error  analysis .  This  algorithm  has 
the  advantage  that  the  final  sum  is  guaranteed  to  have  a  very  small 
relative  error. 


II.  Extended  Summation  With  Cascading  Accumulators 

Wolfe  [2]  suggests  a  technique  which  is  easily  programmed  and  requires 
only  a  small  number  of  additional  storage  locations.  These  extra  locations, 
called  cascading  accumulators,  are  denoted  by  si,  s2,  ...  .  The  separate 
accumulators  hold  sums  that  are  in  various  intervals;  for  example, 
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1.000 

< 

c(sl) 

< 

9-999 

10.00 

< 

c(s2) 

< 

99-99 

100.0 

< 

c(s3) 

< 

999-9 

where  c(si)  denotes  the  contents  of  si  .  The  summing  is  done  at 
the  lowest  level  accumulator  (si)  until  it  is  about  to  overflow. 

At  that  point  it  is  added  to  the  next  accumulator  (s2)  and  reset  to 
zero.  Similarly,  if  s2  is  about  to  overflow,  it  is  added  to  s3  and 
reset  to  zero,  and  so  on. 

By  this  technique  the  intermediate  sums  never  become  much  larger 
than  the  addends.  However,  catastrophic  cancellation  can  occur  just 
as  before.  Wolfe  does  not  discuss  how  to  go  about  summing  the  accumulators 
at  the  end;  in  an  example  he  uses  the  order  of  increasing  magnitude. 

For  certain  problems,  this  is  a  useful  technique;  however,  there  is  no 
guarantee  that  the  final  result  has  a  small  relative  error. 

III.  A  Modification  of  Wolfe's  Algorithm 

The  following  algorithm  requires  little  if  any  more  execution 
time  than  the  algorithm  of  the  last  section,  and  nearly  full- 
precision  accuracy  is  achieved,  provided  exponent  underflow  or  overflow 
do  not  occur.  Such  exceptional  conditions  are  normally  brought  to  the 
attention  of  the  user  by  the  system  software  and,  if  so,  inaccurate 
results  cannot  go  unnoticed.  As  in  Wolfe's  algorithm,  additional 
intermediate  accumulators  are  used  —  typically  fewer  than  50. 
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The  following  discussion  assumes  the  algorithm  is  implemented 
on  a  machine  using  a  floating-point  number  system  F  of  base  3 
(usually  3  is  2,  8,  ID  or  16)  with  a  t -digit  mantissa.  The 
exponent  e  is  assumed  to  lie  in  the  range 

-m  <  e  <  M 

Thus  each  nonzero  xeF  has  the  normalized  representation 

x  =  +  .djdg. . .d^  •  3  ,  (l) 

where  ...,d^  are  integers  satisfying 
1  <  d1  <  3-1  , 

0  <  di  5  3-1  (i  =  2,  ...,t)  . 

The  number  0  belongs  to  F  ,  and  has  the  structure 

0  =  .00... 0  •  3"m  . 

All  floating-point  addition  is  assumed  to  be  normalized.  The  machine 
may  do  either  proper  rounding  or  truncation  (chopping). 

To  facilitate  discussion,  the  function  lev  (similar  to  that  used 
by  Miller  [5])  is  defined  as  follows:  If  xeF  then  tev(x)  =  e  +  m  . 
lev  is  the  biased  exponent  having  the  mnemonic  ''level*'.  Note  that  lev 
is  a  function  of  the  representation  of  a  number  and  not  the  number 
itself.  For  example,  suppose  x  is  to  be  added  to  y  ,  where  |y|  >  jx|, 
and  that  x  must  be  unnormalized  during  operand  alignment.  Suppose  also 
that  no  nonzero  digits  are  lost  from  the  mantissa  of  x  while  it  is  being 
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unnormalized.  If  we  denote  the  unnormalized  representation  of  x  by  x  , 
then  x  and  x  both  represent  the  same  real  number  exactly,  but 
lev(fc)  >  fev(x)  . 

n 

The  algorithm  for  computing  £  x.  can  now  be  described  as 

i=l 

follows.  There  are  two  positive  parameters,  I  and  tj  : 

Assume  there  are  t)+1  accumulators,  the  contents  of  which  are 
denoted  by  o^,a^,  ...,a  . 


1. 


2. 


3. 


Set  each  of  the  accumulators  tc  zero. 

For  each  xi  form  a^, ai2*  •  **>aiq  (<1  >  l)  >  where 

a..+a..+  ,..  +  a.  =  x .  and  each  a .  .  has  the  property 
il  12  iq  l  ij  *  *  J 

that  the  last  l  digits  are  0  (i.e.,  d^  =  d^  ^  = 

-°  )■ 

Each  a^  is  added  to  the  k-th  accumulator,  where  k  is 
determined  by 


vk  <  levCa^)  <  vk+  v  -  1  > 

v  =  T  (M+m+  l)/(r]  +  1) 


(2) 


where  [""  |  ~\  denotes  the  smallest  integer  not  less  than  I  . 
(Thus 

k  *=  levfa^)  tv,  (3) 

in  the  sense  of  Algol  60.) 

4.  The  accumulators  are  summed  in  decreasing  order  (i.e., 

1“1*  •  •  • 
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The  second  step  appears,  at  first  sight,  to  be  quite  complicated. 
However,  in  practice  it  is  easily  done,  especially  on  a  machine  with 
double-precision  arithmetic.  An  illustration  of  this  in  Fortran  for 
the  IBM  System/360  is  contained  in  Section  VI. 

The  parameters  I  and  tj  are  chosen  so  that  the  addition  in  step  3 
retains  all  the  significant  digits  involved.  That  is,  until  step  4,  there 
are  no  rounding  errors.  More  insight  into  choosing  l  and  t)  will  be 
given  in  the  following  section.  Also,  an  important  restriction  on  the 
magnitude  of  the  product  qn  will  be  revealed. 

Step  number  4  is  certainly  the  most  interesting  step  of  the  algorithm. 
If,  instead,  the  accumulators  are  summed  in  increasing  order  (as  one  is 
tempted  to  do  after  reading  Wilkinson  [1]),  catastrophic  cancellation  can 
occur.  When  this  algorithm  is  incorporated  in  an  innezproduct  routine,  it 
often  happens  that 

V° 

<Vi  =  0 

“k  ="B 

Vi*0 

“k-2  =° 

“k-3  =° 

=  Ei 

°k-5  =  e2 
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where  iev(B)  -fev(e^)  >t  .  Slimming  in  increasing  order  will  yield  0  . 
However,  as  D.  Jordan  pointed  out  in  [4],  summing  the  accumulators  in 
decreasing  order  (q, ...,0)  precludes  the  chance  of  this  type  of  error. 
The  remaining  question  is :  Does  summing  the  accumulators  in  decreasing 
order  lead  to  some  other  case  where  a  large  relative  roundoff  error  can 
occur?  The  answer  to  this  question  is  no.  Proof  of  this  assertion  and 
a  sharp  bound  on  the  roundoff  error  are  given  in  the  next  section. 


IV.  Error  Analysis 

Another  convenient  function  is  defined  as  follows:  Let  xeF  be  an 

■ft  *  * 

approximation  of  some  real  number  x  .  If  x  =  x  and  x  /  0  ,  then 

pad(x,x  )  is  defined  to  be  the  number  of  digits  by  which  the  mantissa 

of  x  can  be  shifted  to  the  right  before  a  significant  digit  is  lost 

(i.e.,  before  a  non-zero  digit  is  shifted  out  of  the  low-ordex*  position). 

*  * 

If  x  f  x  ,  then  pad(x,x  )  is  negative,  and  defined  as  follows: 
suppose  x  has  the  representation  (l);  if  there  exists  a  £  such  that 
£  can  be  represented  as 

C  =  +  .djdg. .  .d^  •  0®"*  with  ^  t  0  ,  /  0 

*  * 

and  x+£  =  x  and  T  is  finite,  then  pad(x,x  )  is  defined  to  be  -T  . 

/ 

Otherwise,  pad(x,x  )  is  defined  as  -  ®  .  For  completeness, 
pad(0,0)  =  °°  .  For  example,  if  0  =  2  and  t  =  6  ,  pad(  -  .101000  •  2  ,  -5^) 
If  x  =  +.111111  •  2°  and  y  =  +  .111111  *  2^  ,  and  ©  represents  floating¬ 
point  addition,  and  pad(x®y,x  +  y)  =  -2  since  twe  digits  are  lost  during 
the  floating-point  addition.  When  pad(x,x  )  is  positive,  the  mantissa 
of  x  has  a  "padding"  of  zero  digits  at  tbe  end. 
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It  follows  that 


pad(x,x  )  >  0  »  x  =  x  , 

*  * 
pad(x,x  )  <  0  «  x  /  x  , 

*•  -#- 

pad(x,x  )  >  t  =5  pad(x,x  )=®  »  x=x  =  0. 

In  step  2  of  the  algorithm,  it  is  required  that  pad(a^,a^)  >  I  >  0  , 
for  all  i,j  . 

It  is  also  expedient  to  define 
*£  * 

p(x,x  )  =  lev(x)  +  pad(x,x  )  .  (4) 

* 

p(x,x  )  is  invariant  with  respect  to  operand  alignment  (un-normalization) 
and  post-nomalization  of  x  ,  provided  no  exponent  underflow  or  overflow 
occurs . 

Lemma  li  If  x  and  y  are  two  floating-point  numbers  and  ®  represents 

floating-point  addition,  then 

*  *  *  * 

p(x®y,x  +  y  )  >  min{p(x,x  ),p( y,y  ))  , 

provided  no  exponent  underflow  or  overflow  occurs. 

Proof:  Assume  lev(x)  >  lev(y)  .  Let  z  denote  an  accumulator, 

a  floating-point  number  with  a  t+2  digit  mantissa  and  an 
overflow  digit.  Set  z  •-  y  and,  if  necessary,  unnorraalize  z 
so  that  lev(z)  =  lev(x)  .  The  accumulator  z  can  be  treated 
as  a  floating-point  number  if  one  ignores  the  overflow  digit  and 
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considers  only  the  first  t  digits  of  the  mantissa.  In  this 
way,  pad  is  defined  for  z  .  Let  w  denote  another 
accumulator  with  the  same  structure  as  z  .  Set  w  »-  z+x  . 
Prior  to  the  post -normalization  step  in  forming  w  , 

lev(w)  =  lev(z)  =  lev(x)  and 

pad(w,x  +  y  )  >min{pad(z,y  ),pad(x,x  ))  , 

and  equality  occurs  whenever  the  low-order  digits  of  x  and  y 
don't  cancel.  From  Equation  (k)  and  the  fact  that 

-X-  ¥• 

p(w,x  +y  )  remains  unchanged  during  the  post -normalization 
step,  it  follows  that 

p(x©y,x  +  y  )  =  p(w, x  +y  )  >min{p(z,y  ),p(x,x  )}  . 

*  * 

Since  p(z,y  )  =  p(y,y  )  , 

p(x©y,x  +  y  )  >  min{p(x,x  ),p(y,y  )}  . 
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Lemma  2:  In  the  notation  of  Section  III, 

p(<V<\)  >  vk  +  I  . 


Proof:  Any  term  (y)  that  is  added  to  the  k-th  accumulator  satisfies 

fev(y)  >  vk  and  pad(y,y)  >  I  . 

By  Lemma  1,  Equation  (U)  and  the  fact  that  p(0,0)  =  ®  ,  the  lemma 
follows  by  induction. 


Lemma  3:  If  /  0 


and  N  is  the  number  of 


added  to  the 


accumulators,  then 


vk  <  lev(ak)  <  v(ktl)  -  1  ,  if  N  =  1 

vk-t  +  l  +  1  <  fev(a^)  <  v(k*l)  +  |_  /ogp(N-l)  J  ,  if  N  >  1 

( k  =  0, 1,  •  •  • ,  T]) 

where  |_  I  J  denotes  the  largest  integer  not  greater  tnan  I  . 

Proof:  The  inequalities  for  N=1  are  obviously  the  same  as  those  satisfied 

by  a  single  term  added  to  the  k-th  accumulator.  The  upp3r  bound 
for  N  >  1  is  found  by  considering  the  largest  number  ({)  which 
can  be  added  to  the  k-th  accumulator,  i.e., 

{  =  +.zz. .  .zOO.  ..0  •  pv(k+1)‘1_m  f  (z  =  p-l) 
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and  observing  the  fevCa^)  during  repeated  additions  of  £  to  . 
If  the  lover  bound  for  N  >  1  were  not  true,  i.e.,  if 

levfa^)  <vk-t+l  +  l  , 

then,  by  Lemma  2, 

pad^,^)  >  t  *  Q^  =  0  , 

which  is  a  contradiction. 

Lemma  4:  If  N  <  £*_v+1  ,  then  each  of  the  0^  (k  =  0,1,  ...,7])  is 
exact. 

Proof:  By  Lemma  3> 

levCa^)  <  vk  +  f  +  1  . 

Combining  this  with  Equation  (U)  and  Lemma  2  gives 
pad(QCk,ak)  -  0  * 

which,  by  the  definition  of  pad  ,  implies  is  exact. 

Loss  of  precision  in  an  extended  summation  can  result  from  either 

1.  repeated  truncations  (roundings)  of  the  sum,  or 

2.  post -normalization  left  shift  of  the  approximate  sum. 

The  post-nomalization  error  can  be  formalized  as  follows:  Let  the 
accumulation  of  the  floating-point  sum 

where  the  x ^  (i  =  1,2, . .  .,n)  are  exact,  be  defined  as 
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*0  -  0 

*i  -  *i_i  ®  *i  »  (i  =  1,2,  ...,n)  , 


The  function  A  of  two  floating-point  variables  is  defined  as 

A(x,y)  ^  max{lev(x),lev(y)}  -  £ev(x©y) 

Thus,  during  post -normalization  of  the  floating-point  sum  x©y  , 
the  mantissa  undergoes  a  left  shift  of  A(x,y)  digits.  Clearly,  if 
a  carry  occurs,  A(x,y)  =  -1  .  Also,  A(x,y)  >  1  only  if 
j  £ev(x)  -  iev(y) |  <  1  . 

During  the  formation  of  \(i\  =  \|r^  ^®x^  ,  any  truncation  error 
already  present  in  the  low  order  digits  of  ^  1  is  multiplied  by 
A(*i-l,Xi) 


P 


The  accumulators  are  summed  in  decreasing  order.  Thus,  the  sum 


SQ  can  be  defined  by 


=  0 

7}+l 


Sk  -  (k  -  T)-  T)-l,  . .  .,0) 


U) 


Lemma  5:  If  S,_  is  exact  and 

- c  k+1 

is  un-normalized  so  that  !ev(S 

* 

PadC^i'Sfcfi.)  >0  *  Provided 


^ev(Sk+i)  <  tevia^.) 
k+1)  =  teviOy)  ,  then 


and  if 


S 


k+1 


Proof;  Lemma  1  and  Equation  (5)  yield 

p^Sk+l,Sk+l^  -  min^p^0Wl,ak+l^,p^a]c+2,ak+2^  '  *  ',p^°tyCV^  > 
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which,  with  Lemma  2  and  the  definition  of  p  ,  gives 


*ev(sk+1)  +  p«Ld(s1^1,  W  >  + '  • 

Substituting  iev(ak)  for  lev(Skf^)  and  using  Lemma  3,  we 
obtain  the  desired  result: 

pad(Sk+1,S*+1)  >  l  -  |_  iogp(N-l)  J  . 


Suppose  that,  in  the  process  of  summing  the  accumulators  as 
described  by  Equation  ( 5)  ,  k  =  j  is  the  first  k  such  that 
padi'S  ,S  )  <  0  .  As  a  result  of  Lemma  5>  the  truncation  (rounding) 

K  K 

qyco:  in  adding  S  ana  a.  must  be  caused  in  one  of  three  ways: 

3  3 

i)  When  the  operands  are  aligned,  pad(ct  A><XA)  =  0  and  a  carry 


J  J 


occurs . 


ii)  When  the  operands  are  aligned,  pad(S^+1,S^+1)  =  0  and  a 
carry  occurs. 

iii)  iev(S-  )  >  fev(a.)  and,  when  a  is  aligned  so  that 

J  J  J 

^ev(aj)  =  iev(SJ+1)  ,  pad(aL,a*)  <  -  A(SJ+1,aj)  <  0  . 


Lemma  6: 


£ev(S  .)  >  vj  +  i  +  1  . 
3 


Proof:  Case  i)  In  the  aligned  position,  using  Lemma  2,  we  find  that 

fev(a.)  =  p(a.,a*)  >  vj  +  l  . 

3  3  3 

Thus  £ev(S .)  =  £ev(a.)  +  l>vj  +  £  +  l. 

3  3 

Case  ii)  lev(S.+1)  =  p(Sj+1>Sj+1)  >  min{p(a..+1,a*+1) , . .  .,p(a  ,a*)) 

>  v  j  +  v  +  I 
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Therefore 


lev(Sj)  >vj  +  v+l  >  vj  +  i  +  1  . 


Case  iii) 


When  a. 
J 


is  aligned, 


fev(SJ+1)  =  lev( a  )  >  p(a  ,a*)  >  vj  +  I 


Now, 

,w<V  ■  +  A(sj+i>aj> 


>  iev( Sj+1)  >  iev(aj)  =vj  +  i  . 

Thus, 

f  ev(Sj)  >  vj  +  I  +  1 

Since  tev( a,  ,)  <  vj  +  |_logQ(N-l)  J  , 

'  ,1-1  ~  P 

£ev(S^.)  -  £ev(OL  >  l  +1-  |_|0gp(N-l)  J  . 

I  -vf  1 

Tlie  assumption  in  Lemma  4  (i.e.,  N  <  P  )  is  sufficient  to  guarantee 
that  fev(S.)  -iev(a._  )  >1  ,  from  which  it  follows  that  A(S  ,0^  )  <  1  . 

Similarly,  each  of  the  subsequent  additions  can  undergo  a  post¬ 
normalization  left  shift  of  at  most  one  digit.  In  fact,  at  most  one  of 
the  additions 


^  (fc  -  j ”1>  j "2,  •  • 0) 

will  undergo  a  post -normalization  left  shift  of  one  digit. 

£  -v+1 

Lemma  7 :  If  N  <  p  ,  the  mantissa  of  each  of  the  accumulators 

oc  ^  .  ..,aQ  is  shifted  at  least  t  digits  during  operand 

ali g nm ent ..  where 


lU 


k  =  r  (t+D/vi 


(6) 


Proof:  By  Lemma  6, 


lev(Sj_^)  >  vj  +  1  ,  (i  =  0,1, ...,j) 

By  Lemma  3, 

fev(a^_^)  <  v(j-i)  +  1  +  1  ,  (i  =0,1,  , 

and  fev(Sj  -  lev(o£j_^)  >  vi  -  l>t  ,  (i  =  \,\+l,...,j)  . 

Thus  the  mantissa  of  each  of  the  accumulators  a  ,a.  ,  ...,ol 

J-A.  J-A.-1  0 

is  shifted  at  least  t  digits  during  operand  alignment. 

Theorem  1:  If  Kg'  ,  if  the  accumulator  used  in  accumulating  SQ 
has  at  least  t+1  digits,  and  if  no  underflow  or  overflow  occurs, 
then  the  absolute  error  in  SQ  is  bounded  by 

!ev(S  )-m-t+l 

|s-S0|  <^6p 

where 

|  1  for  chopped  arithmetic 

5  -  <  i 

1  "  for  rounded  arithmetic 
and  \  is  given  by  Equation  (6). 

Proof:  Since  a  post-normalization  left  shift  of  at  most  one  digit  can 

occur  only  once  while  the  accumulators  are  summed,  the  worst  case 

occurs  when  it  is  caused  by  the  addition  of  a,  ^  (see  Lemma  7)« 

Subsequent  additions  of  •  •  •  cannot  affect  the  computed 

value  of  SQ  (see  Knuth’s  [6]  discussion  of  problem  5 >  page  ^9^)  • 

Prior  to  the  addition  of  a.  ^  ,  a  maximum  of  \  truncations  (roundings) 

!ev(Sn)-m-t 

can  occur,  each  resulting  in  an  error  of  (3  u  or  less. 

Q.E.D. 
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For  machines  which  use  t-digit  accumulators  and  chopped  arithmetic, 

I  ev(  )  -m-t+1 

the  error  bound  is  (\-l)0  (a  stronger  result!).  Note 

that,  although  the  above  theorem  gives  a  bound  on  the  absolute  error,  it 
also  provides  a  bound  on  the  relative  error.  Specifically,  if  the  true 
value  of  the  sum  s  is  zero,  then  Sq  =  0  . 


Theorem  2: 


where 


»  V/+1 

If  N  <  0  and  no  underflow  or  overflow  occurs,  then 

s  =  S0(l+e)  , 

M  <  • 


Proof:  i)  If  SQ  =  0  ,  then,  since  total  cancellation  of  significant 

digits  cannot  occur  in  summing  the  accumulators,  s  =  0  . 

ii)  If  Sq  /  0  ,  then  assume  s  -  SQ  *  SQe  .  By  Theorem  1, 

fev(SQ) -m-t+1 

l»-S0|  =  |S0|  |e|  <*»* 


Since  SQ  >  0 


iev(S0)-m-l 


e  <  X.  6  0 


2-t 


These  theoretical  results  are  substantiated  by  an  experiment 
reported  by  D.  Jordan  [4].  Jordan  used  this  technique  for  accumulating 
innerproducts  on  an  IBM  360  (0  =  1 6,  t  =  lk) .  He  chose  1]  =  32  ,  1=6 

and  q  =  2  ,  and  states: 
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"ESnpirical  tests  were  run  to  determine  the  small  amount  of  roundoff 
that  might  be  expected  from  the  procedure.  The  tests  used  1D00 

dot  products  of  15-camponent  vectors  where  the  components  were 

30  30 

randomly  generated  in  the  range  (-10  ,  10  )  .  The  results  of 

this  routine  were  checked  against  results  obtained  using  256 
hex-digit  arithmetic  chrough  the  multiple  precision  arithmetic 
package  written  by  J.  R.  Ehrman  of  SLAG.  Of  the  thousand  cases, 

467  were  in  exact  agreement,  537  had  an  erroneous  last  bit  and  3 
had  an  erroneous  penultimate  bit."  [4,  p.  3]. 

If  the  last  sentence  in  Jordan's  statement  were  changed  to  read 
"...  467  were  in  exact  agreement,  537  had  an  erroneous  last  (hexadecimal) 
digit  and  3  had  an  erroneous  penultimate  digit.",  then  Jordan's  results 
are  consistent  with  those  of  Michael  Saunders  at  Stanford  University. 
Saunders  performed  several  experiments  on  an  IBM  360  using  £  =  16, 
t  =  14,  71  =  43,  l  =  6  and  q  =  2  .  He  found  examples  where  the  13-th 
hexadecimal  digit  of  the  result  was  in  error,  but  none  with  errors  in 
the  12-th  digit.  Errors  of  this  size  are  consistent  with  Theorem  2. 


V.  Additional  Modifications  to  the  Algorithm 

If  one  desires  the  final  floating-point  result  to  be  correct  in 
all  digits,  the  following  procedure  can  be  used  immediately  after 
calculating  SQ  : 

1.  Form  a.,a0,  ...,a  (q  >  l),  where  a,  +  a_  +  . . .  +  a  =S. 

12*  q  -  •*  12  q  0 

and  each  a^  has  the  property  that  the  last  l  digits  of 
its  mantissa  are  0  . 

2.  Add  -a^ -a2,  . . . , -a^  to  the  accumulators. 

3.  Sum  the  accumulators  in  decreasing  order.  Call  the  result  A  . 

4.  S0  +  A  is  the  full  precision  result. 
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In  problems  where  N  may  get  arbitrarily  large  (e.g.,  numerical 

integration),  all  is  not  lost.  One  merely  increments  an  integer  every 

time  a  term  is  added  to  the  accumulators  and  when  the  integer  is  equal 
I  -v+1 

to  0  -m(Tftl)  ,  the  following  procedure  is  executed: 

1.  reset  the  integer  to  zero. 

2.  for  i  :=  0  step  1  until  T|  do 
begin 

a  :=  a. ;  a.  : =  0; 
i*  i 

addtoaccumulators  ( a) 

end 

where  the  procedure  addtoaccumulators  forms  the  a^,  ...,ai(^ 
variables,  adds  q  to  the  integer  and  adds  the  a^ 

(j  =  1, ...,q)  to  the  accumulators. 

3.  Resume  the  original  algorithm. 
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VI.  Conclusion 


The  generality  of  the  preceding  discussion  tends  to  obscure  the 

simplicity  of  the  algorithm.  For  this  reason,  a  simple  illustrative 

example  of  this  technique  programmed  in  Fortran  for  the  IBM  360  is  included: 

REAL  FUNCTION  SUM(X,N) 

EQUIVALENCE  (IEQ,W) 

DIMENSION  X(l) 

REAL*8  R(43),S8,DBLE 
DO  10  1=1,43 
10  R(I)=0.0D0 

DO  20  1=1,  N 
W=ABS(X(l)) 

IEXP=IEQ/ 50331648  +  1 

C  50331648  IS  3*(2**24)  WHICH  SHOTS  RIGHT  24  BITS  AND 
C  DIVIDES  BY  3.  ABS  GETS  RID  OF  THE  SIGN  BIT. 

20  R  ( IEXP)  =R  ( IEXP)  +  DBLE(X(I)) 

S8=0.0D0 
DO  30  1=1,43 
30  S8=S8  +  r(44-I) 

sum=sngl(s8) 

RETURN 

END 


This  subroutine  finds  the  sum  of  a  vector  of  short -precis  ion  (t  =  6) 
numbers.  Since  the  360  used  has  long-preciBion  (t  =  14)  floating-point 
hardware,  it  is  convenient  to  use  14  digit  accumulators  and  append  8  zero 
digits  at  the  end  of  each  .  Thus,  q  =  1  and  1=8.  The  value 
T)  =  43  was  chosen  to  give  v  =  3  •  Thus,  since  0  =  16  ,  the  short- 
precision  result  is  guaranteed  to  have  full-precision  (chopped) 
accuracy  provided  N  <  16^  =  16,777,216  . 
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This  example  is  typical  in  that  q  is  usually  small  (1  or  2  )  , 
l  is  usually  chosen  for  convenience,  and  T)  is  usually  chosen  so  the 
accumulators  can  be  quickly  indexed  and  so  v  is  sufficiently  small. 

The  algorithm  is  currently  used  in  several  innerproduct  routines 
(see  Malcolm  [3]  for  descriptions  of  these  routines,  including  running 
times) .  Since  efficiency  is  satisfactory,  it  may  well  be  feasible  to 
implement  this  technique  through  a  microprogram  so  that  the  programmer 
can  specify  by  a  certain  operation  code  that  a  summation  is  to  be 
performed  w±th  a  set  of  these  accumulators  rather  than  with  a  single 
accumulator. 
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